English-Chinese  Information  Retrieval  at  IBM 


Martin  Franz,  J.  Scott  McCarley,  Wei-Jing  Zhu 
IBM  T.J.  Watson  Research  Center 
RO.  Box  218 

Yorktown  Heights,  NY  10598 


Abstract 

We  describe  TREC-9  experiments  with  an  IR  system  that  incorporates 
statistical  machine  translation  trained  on  sentence-aligned  parallel  cor¬ 
pora  for  both  query  translation  (English  =^Chinese)  and  document  transla¬ 
tion  (Chinese=^English  .)  These  systems  are  contrasted  with  monolingual 
Chinese  retrieval  and  with  query  translation  based  on  a  widely  available 
commercial  machine  translation  package.  These  systems  incorporate  both 
words  and  characters  as  features  for  the  retrieval.  Comparisons  with  a 
baseline  from  TREC-5/6  enable  our  experiments  to  address  issues  related 
to  the  differences  between  Beijing  and  Hong  Kong  dialects. 


1  Chinese  preprocessing 

The  TREC-5/6  corpus  is  in  the  Taiwanese  dialect  of  Chinese,  and  is  encoded  in 
the  GB-2312  character  set.  The  TREC-9  corpus  consists  of  news  stories  from 
Hong  Kong,  and  is  encoded  in  the  Big-5  character  set.  In  order  to  perform 
comparable  experiments  on  both  corpora,  we  adopt  UTF-8  encoded  Unicode 
as  our  internal  representation  of  Chinese  characters.  In  order  to  study  base¬ 
line  retrieval  performance,  we  converted  the  TREC-5/6  Chinese  track  corpus 
from  GB  to  Unicode.  We  converted  the  TREC-9  corpus  from  Big-5  to  Unicode 
(ignoring  the  “extra”  HKSAR  hanzi.)  We  note  that  Unicode  often  contains  at 
different  code  points  both  the  simplified  and  traditional  forms  of  the  same  hanzi; 
the  mappings  relating  the  simplified  and  traditional  forms,  as  well  as  other  se¬ 
mantic  variants  within  Unicode  are  well-documented  [1].  Any  character  that 
could  be  linked  to  a  simplified  Chinese  character  (including  indirect  linkings) 
was  mapped  to  that  character;  simplified  characters  linked  to  each  other  were 
mapped  to  the  smaller  Unicode  number. 


1 


Report  Documentation  Page 

Form  Approved 

OMB  No.  0704-0188 

Public  reporting  burden  for  the  collection  of  information  is  estimated  to  average  1  hour  per  response,  including  the  time  for  reviewing  instructions,  searching  existing  data  sources,  gathering  and 
maintaining  the  data  needed,  and  completing  and  reviewing  the  collection  of  information.  Send  comments  regarding  this  burden  estimate  or  any  other  aspect  of  this  collection  of  information, 
including  suggestions  for  reducing  this  burden,  to  Washington  Headquarters  Services,  Directorate  for  Information  Operations  and  Reports,  1215  Jefferson  Davis  Highway,  Suite  1204,  Arlington 

VA  22202-4302.  Respondents  should  be  aware  that  notwithstanding  any  other  provision  of  law,  no  person  shall  be  subject  to  a  penalty  for  failing  to  comply  with  a  collection  of  information  if  it 
does  not  display  a  currently  valid  OMB  control  number. 

1.  REPORT  DATE 

2006 

2.  REPORT  TYPE 

3.  DATES  COVERED 

00-00-2006  to  00-00-2006 

4.  TITLE  AND  SUBTITLE 

5a.  CONTRACT  NUMBER 

English-Chinese  Information  Retrieval  at  IBM 

5b.  GRANT  NUMBER 

5c.  PROGRAM  ELEMENT  NUMBER 

6.  AUTHOR(S) 

5d.  PROJECT  NUMBER 

5e.  TASK  NUMBER 

5f.  WORK  UNIT  NUMBER 

7.  PERFORMING  ORGANIZATION  NAME(S)  AND  ADDRESS(ES) 

IBM  TJ.  Watson  Research  Center ,PO  Box  218,Yorktown 

Heights, NY, 10598 

8.  PERFORMING  ORGANIZATION 

REPORT  NUMBER 

9.  SPONSORING/MONITORING  AGENCY  NAME(S)  AND  ADDRESS(ES) 

10.  SPONSOR/MONITOR'S  ACRONYM(S) 

11.  SPONSOR/MONITOR'S  REPORT 
NUMBER(S) 

12.  DISTRIBUTION/AVAILABILITY  STATEMENT 

Approved  for  public  release;  distribution  unlimited 

13.  SUPPLEMENTARY  NOTES 

14.  ABSTRACT 

15.  SUBJECT  TERMS 

16.  SECURITY  CLASSIFICATION  OF: 

17.  LIMITATION  OF 
ABSTRACT 

18.  NUMBER 

OF  PAGES 

6 

19a.  NAME  OF 
RESPONSIBLE  PERSON 

a.  REPORT 

unclassified 

b.  ABSTRACT 

unclassified 

c.  THIS  PAGE 

unclassified 

Standard  Form  298  (Rev.  8-98) 

Prescribed  by  ANSI  Std  Z39-18 


2  Chinese  IR  System  Description 

The  Chinese  IR  track  in  TREC-5/6  triggered  extensive  experimentation  on 
whether  Chinese  characters  should  be  automatically  tokenized  (“segmented”) 
into  words  to  use  as  features  for  IR,  or  whether  the  characters  themselves  (and 
n-grams  of  characters)  should  be  used  as  tokens  for  IR.  No  clear  consensus  has 
emerged  [2,  3],  see  also  [4,  5],  although  there  are  good  reasons  to  prefer  shorter 
words  (limited  to  less  than  about  4  characters)  [6]  as  well  as  to  incorporate  both 
types  of  features.  Our  approach  to  incorporating  both  words  and  characters  is 
to  build  two  separate  systems,  closely  modeled  on  our  English  IR  system  [7], 
and  to  merge  the  results  by  linear  combination  of  scores. 

Both  corpora  were  segmented  with  a  statistical  segmenter  similar  to  the  one 
discussed  in  [8].  The  corpus-based  iterative  approach  to  Chinese  segmentation 
allowed  us  to  customize  the  segmenter’s  language  model  probabilities  to  each 
corpus.  The  segmenter’s  vocabulary  consisted  mostly  of  two-character  words, 
with  no  words  exceeding  5  characters,  since  there  is  evidence  that  short  words  are 
preferable  (longer  words  often  fail  to  match  any  terms  in  queries)  for  information 
retrieval  purposes. 


A  D 


B  E 


Figure  1:  Diagram  of  system 


Our  Chinese  (monolingual)  IR  system  is  a  two-pass  system,  in  which  the 
results  of  the  initial  retrieval  are  used  to  construct  an  expanded  query,  which 
is  then  used  for  a  second  pass  retrieval.  The  outline  of  the  system  is  indicated 
in  Fig.  (1).  For  generality  of  explanation,  we  assume  that  the  query  has  al¬ 
ready  been  preprocessed  into  two  forms,  one  in  which  it  has  been  automatically 
been  tokenized  into  short  words  for  use  as  IR  features,  and  one  in  which  each 
character  is  a  separate  token.  The  first  pass  scoring  is  based  on  the  Okapi  for¬ 
mula  [9],  using  the  characters  as  features  in  (A),  and  using  short- word  tokens  as 
features  in  (B).  The  results  are  merged  at  (C)  by  linear  combination  of  scores. 
Our  query  expansion  is  based  on  LCA  [10],  which  selects  features  from  the  top- 
ranked  documents  (output  at  (C))  which  frequently  cooccur  with  query  features 
in  these  documents.  At  (D)  we  query-expand  the  character-based  representa¬ 
tion,  i.e.,  we  look  for  characters  that  frequently  cooccur  with  query  characters 
in  the  top-ranked  documents.  At  (E)  we  query-expand  the  short-word-based 
representation  of  the  corpus.  Both  query  expansions  are  merged  at  (F)  to  yield 
the  final  results. 
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3  Crosslingual  IR  Experiments 

3.1  Query  translation  with  a  statistical  model 

We  used  two  parallel  corpora  (Hong  Kong  Laws  and  Hong  Kong  News,  avail¬ 
able  from  the  Linguistic  Data  Consortium  as  part  of  the  Topic  Detection  and 
Tracking  (TDT)  project  [12]),  and  a  smaller  amount  of  material  from  the  FBIS, 
to  build  a  character-based,  statistical  translation  model.  Because  the  major¬ 
ity  of  parallel  text  was  from  Hong  Kong,  we  expect  this  translation  model  to 
be  particularly  well- matched  to  the  TREC-9  test  set,  and  less  well  suited  for 
the  TREC-5/6  baseline  (in  contrast  to  the  commercial  translation  package  de¬ 
scribed  above.)  We  built  a  model  of  the  probability  p(c\E,  E+,  EJ)  where  c  is 
a  Chinese  character,  E  is  an  English  word,  and  E+  and  are  the  nearest 
following  and  preceding  content  words.  Models  of  this  structure  have  previ¬ 
ously  been  described  in  [13].  These  models  are  trained  from  a  sentence-aligned 
parallel  corpus,  together  with  a  word  alignment.  The  word  alignment  is  con¬ 
structed  automatically  from  the  parallel  corpus  using  Poisson-fertility  model,  as 
described  in  [14,  15]  This  model  predicts  only  characters,  not  the  order  of  char¬ 
acters.  The  ordering  was  imposed  when  possible  by  dictionary  lookup,  using  a 
Chinese- English  dictionary  made  available  for  the  TDT  project.  We  note  that 
ordering  the  characters  was  not  necessary  for  the  character-based  aspect  of  IR, 
but  was  necessary  in  order  to  segment  the  query  into  words  (and  hence  for  the 
word-based  aspect  of  IR.)  Experiments  with  this  model  are  denoted  QT  in  the 
results. 

3.2  Document  Translation  with  a  Statistical  Model 

Because  of  our  prior  success  mixing  query  translation  and  document  translation 
[16],  we  also  built  a  Chinese=^English  translation  model  from  the  same  parallel 
corpora  as  above.  This  model  is  not  directly  comparable  to  statistical  query 
translation  model  -  it  is  a  word-based  model  for  p(E\C)  the  probability  of  an 
English  (morphological  root)  word  E  given  a  Chinese  word  C  (determined  from 
an  automatic  segmentation  of  the  corpus.  This  model  is  also  a  Poisson-  fertility 
model.  When  the  corpus  was  translated,  the  translation  model  was  supple¬ 
mented  by  the  LDC  dictionary.  Since  the  resulting  translation  of  the  corpus  is 
English,  there  is  no  character /word  distinction  in  the  IR  system  associated  with 
this  retrieval.  Results  with  this  model  are  denoted  DT  in  the  results. 

3.3  Query  translation  using  commercial  software 

Another  set  of  experiments  involved  translating  the  English  version  of  the  query 
using  a  widely  available  commercial  machine  translatation  package  [11].  These 
experiments  will  be  denoted  TW  in  the  results.  Since  this  software  was  devel¬ 
oped  in  Taiwan,  we  expected  that  it  would  be  more  closely  matched  to  the  TREC 
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5/6  baseline  than  to  the  the  TREC-9  test  set.  The  output  of  the  translation 
package,  which  consists  of  unsegmented  characters,  is  automatically  segmented 
as  into  words  and  characters  and  then  used  in  our  Chinese  IR  system. 


4  Discussion  of  Results 

The  character-based  half  of  the  system  generally  outperformed  the  word-based 
half  of  the  system,  across  both  types  of  Chinese=^English  translation,  and  mono 
lingually,  especially  on  the  first  pass  of  scoring.  Query  expansion  made  the 
differences  between  character-  and  word-based  retrieval  less  clear.  The  gain 
from  mixing  character-based  and  word-based  results  was  only  slight.  This  result 
seems  to  be  true  for  both  the  TREC-5/6  set  and  the  TREC-9  set,  and  so  is  prob¬ 
ably  independent  of  dialect.  On  the  other  hand,  dialect  strongly  influenced  the 
relative  behavior  of  the  two  query  translation  systems.  The  Taiwan-built  com¬ 
mercial  system,  as  expected,  performed  better  on  the  TREC  5/6  task  (Beijing 
data),  whereas  the  statistical  system,  trained  on  Hong  Kong  data,  performed 
better  on  the  TREC-9  task  (Hong  Kong  corpus.) 

Our  submission  system  was  a  merging  of  the  TW,  QT,  and  DT  systems. 
However,  the  relative  ranks  of  the  TW,  QT,  and  DT  systems  are  completely 
reversed  between  TREC-5/6  and  TREC-9,  presumably  mostly  as  a  result  of 
dialect  differences.  Thus  TREC-5/6  was  could  not  be  used  as  a  training  set 
to  predicting  merging  weights,  etc.  for  TREC-9.  However,  in  both  sets  (un¬ 
like  many  other  IR  tasks)  the  value  of  merging  the  results  of  different  systems 
questionable. 
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